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Abstract — Automated random testing has shown to be an 
effective approach to finding faults but still faces a major 
unsolved issue: how to generate test inputs diverse enough to 
find many faults and find them quickly. Stateful testing, the 
automated testing technique introduced in this article, generates 
new test cases that improve an existing test suite. The generated 
test cases are designed to violate the dynamically inferred 
contracts (invariants) characterizing the existing test suite. As 
a consequence, they are in a good position to detect new errors, 
and also to improve the accuracy of the inferred contracts by 
discovering those that are unsound. 

Experiments on 13 data structure classes totalling over 28,000 
lines of code demonstrate the effectiveness of stateful testing in 
improving over the results of long sessions of random testing: 
stateful testing found 68.4% new faults and improved the 
accuracy of automatically inferred contracts to over 99%, with 
just a 7% time overhead. 
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I. Introduction 

Drawing inputs at random may sound like a desultory 
approach to testing, since it ignores any information about 
the structure of the system under test. This intuition, however, 
turns out to be largely flawed: there is now a compelling 
amount of evidence — both empirical f\\ and analytical |2| — 
showing that random testing is a quite effective testing tech- 
nique that can uncover many subtle errors in real programs. 

When the tested software is equipped with contracts (pre 
and postconditions) random testing even becomes a completely 
automated technique: preconditions help select valid inputs 
and postconditions provide oracles to check if a test case 
exposes unexpected behavior that does not conform to speci- 
fication. The applications of random input generation are not 
limited to testing but extend to other software dynamic analysis 
techniques, such as inference of contracts | 3 1 (improving and 
completing those written by programmers) and even automated 
program correction i4j. 

Constructing random inputs is straightforward for primitive 
types, such as integers and characters, where it boils down 
to drawing pseudo-random numbers. Constructing random 
objects of arbitrary classes is more involved, because objects 
can only be created and modified using a class' routines 
(methods). To approach this problem, random input generation 
algorithms for object-oriented languages maintain an object 
pool, which stores all objects randomly generated during the 
current testing session. The pool is populated either with fresh 



objects, built from scratch by creation procedures (construc- 
tors), or with objects returned by random routine calls on 
objects of appropriate type, randomly drawn from the pool. 
Routines and creation procedures with arguments are handled 
by recursively drawing from the pool conforming objects to 
be used as arguments. A test case is the combination of any 
target object in the pool with a routine applied to it. 

Random testing sessions must last several hours to maxi- 
mize error-finding effectiveness |[T|, ||2|. A drawback of this 
necessity is that the object pool grows to contain a large num- 
ber of objects, even when duplicates are pruned. Therefore, 
the probability of generating at random test cases that would 
expose new bugs significantly decreases over time: the objects 
needed to generate the "missing" test cases may already be in 
the object pool, but they are unlikely to be drawn at random 
because they constitute only a small fraction of the whole pool. 

This paper presents stateful testing, a dynamic analysis 
technique that builds on top of random testing and magnifies 
its effectiveness. Stateful testing takes over where random 
testing gives up: after long sessions of random test case 
generation, the number of faults found reaches a plateau or 
grows sluggishly, and the object pool contains thousands of 
objects. At this point, stateful testing populates a database 
with the content of the pool stored as serialized objects; the 
database is searchable for objects that satisfy given predicates. 
For example, we can look up an object n of class INTEGER 
such that n >0, or an object s of class SET that satisfies 
not s.is_empty (that is, the set contains at least one element). 

After populating the database, stateful testing runs dynamic 
contract inference |3| on all passing test cases generated 
during random testing; the result of this step is a collection of 
pre and postcondition clauses that summarize the properties 
of the test cases. Dynamic contract inference characterizes 
the passing test cases with pre and postconditions based on 
templates, which capture recurring usage patterns that lend 
themselves to "meaningful" generalization. For object-oriented 
programs, the set of public queries (functions) of a class often 
provides a valuable collection of predicates to be combined in 
templates; is_empty in the example above is a public query that 
often appears in contracts (inferred and programmer- written). 
Since the inference is based on a finite number of observations 
and on heuristics in the form of templates, some of the inferred 
contracts can be unsound: they merely are a reflection of the 
test cases that have been exercised. 

Stateful testing combines the information stored in the 



database of objects and the inferred contracts, with the goal 
of mutually enhancing the test suite and the contracts, along 
the lines of Xie and Notkin's proposal |5|. Stateful testing 
proceeds by systematically searching the database for objects 
that violate some of the inferred contracts and therefore 
enable the creation of new test cases. A new test case that 
executes successfully shows that an inferred contract can be 
violated without compromising execution, hence the contract 
is unsound and should be discarded. A new test case that 
triggers a failure exposes an faults overlooked in the previous 
testing session, corresponding to an input never tried before. 
Either way, the new test cases improve over the previous 
testing session by reaching out regions of the object space 
previously unexplored. Take, for example, a routine wipe_out 
of class SET, which removes all the elements contained in the 
set. If wipe_out has always been called on empty sets, dynamic 
contract inference suggests the precondition is_empty. Then, 
select an object s that violates the precondition, that is such 
that not s.is_empty. If the call s.wipe_out succeeds, it shows 
that the inferred precondition s.is_empty is unsound and 
should be removed. If the call triggers a failure, it exposes 
a fault in the routine's implementation, which does not handle 
correctly sets that are not already empty. 

We implemented stateful testing within our AutoTest ||6j 
framework for random testing of object-oriented Eiffel appli- 
cations; the implementation is integrated in EVE |7 1, the freely 
available research branch of the EiffelStudio development 
environment. In an extensive set of experiments described in 
the paper, we applied stateful testing to the historical data 
generated by running AutoTest for 520 hours on 13 classes 
from the EiffelBase 1 8 1 and Gobo |9 1 data structure collections. 
Both libraries have a long development history and are widely 
used in the Eiffel community. AutoTest generated 149,293 
distinct test cases, exposed 95 faults in the libraries, and 
inferred hundreds of new contracts. We applied stateful testing 
for 36 hours on this massive data set. In this relatively limited 
amount of time, stateful testing exposed 65 new faults (68.4% 
improvement) and invalidated 39.3% of the inferred contracts; 
manual inspection reveals that almost all the retained contracts 
are sound. These figures are promising and demonstrate that 
stateful testing is an effective technique to boost the effective- 
ness of random testing and dynamic analysis. 

The rest of the paper is organized as follows: Section |ll| 
gives an overview of stateful testing with a few examples; 



Section |III| describes the details of the technique; Section [IV| 
outlines the design of the relational database used to store 
the results of the initial dynamic analysis; Section |V| reports 
the experimental evaluation of stateful testing; Section VI 
discusses limitations and future work; Section |VII| presents 
related work; Section |VIII| concludes. 

II. Examples 

This section presents three detailed examples that demon- 
strate the applicability of stateful testing; the examples are 
from the libraries EiffelBase and Gobo. 



A. Unsound preconditions 

The first example shows how stateful testing can generate 
tests with a better coverage and detect unsound preconditions. 
Class TWO_WAY_SORTED_SET is the standard Eiffel imple- 
mentation of sets with ordered elements. The class includes a 
public routine 

merge (other: TWO_WAY_SORTED_SET) 

which inserts all elements of other into the Current set (this 
in Java or C#). After running for 40 hours, AutoTest reports 
a dynamically inferred precondition for merge: 

pre_l: Current disjoint (other) , 

indicating that it has only been called on disjoint sets: 
Current H other = 0, hence the functionality of merge has 
not been tested thoroughly. 

Stateful testing takes over from this situation and tries to 
generate new test cases that cover the deficiency. To this end, it 
looks up the database — filled with data from hours of random 
testing — for objects of suitable type that violate pre_l\ namely, 
it searches for two objects ol, o2 such that: 

(1) ol.type = TWO_WAY_SORTED_SET , 

(2) o2.type = TWO_WAY_SORTED_SET , 

(3) not ol. disjoint (o2) . 

Even if AutoTest never drew such objects during the 40-hour 
session, there are several pairs satisfying the three constraints 
(1-3) in the database. For every such pair of objects, stateful 
testing generates the new test case ol. merge (o2). 

Executing the new test cases improves the coverage of 
routine merge', it also reveals that the inferred precondition 
pre_l is unsound and must be reduced, hence removing an 
error in the inferred contracts. In our experiments, the new 
test cases did not expose any faults in the implementation of 
merge. 

B. Unsound postconditions 

The second example shows how stateful testing can de- 
tect unsound dynamically inferred postconditions. Routine 
mergejeft (other: LINKED _LIST) in class LINKED _LIST 
merges the content of other into the Current list. Extensive 
dynamic analysis reports, among others, the following post- 
condition for mergejeft: 

post_2: old Cnrreni.is _equal (other) 

implies Cnrreni.is _empty . 

That is, whenever Current and other contain the same ele- 
ments (they are equal), they are actually empty lists. post_2 is 
unsound, as it merely reflects the fact that the test suite never 
ran mergejeft on lists that are equal but not empty. 

Stateful testing targets the antecedent in the implication 
post_2, which refers to the state before executing mergejeft 
by means of the old notation. The structure of the post- 
condition suggests to exercise the routine on objects ol , o2 
where old ol .is_equal(o2) is the case, but not ol .is_empty, 
with the hope of showing that postJ2's consequent does not 
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hold after the call. Stateful testing creates a new test case 
ol .merge _left (o2) for every pair of objects in the database 
that satisfy the criteria. Since mergejeft does not remove 
any element from the target ol, not ol .is_empty still holds 
after executing the test cases, thus invalidating post_2 and 
increasing the coverage of mergejeft. 

C. Constructing new objects 

The third example shows how stateful testing can gener- 
ate new objects by mutating other objects serialized in the 
database. The example targets class TWO_WAY_TREE, din 
implementation of trees with arbitrary number of branches at 
each level. An object of type TWO_WAY_TREE encapsulates 
a tree's node; each node includes a list of references to its 
children — empty if the node is a leaf — and a cursor. The 
cursor is an iterator over the list of children, pointing to an 
element in the list or being ojf the list. Given two nodes 
ni,n2, we can merge n2's children into ni's by calling 
nl.merge_tree_after (n2): n2's list merges into ni's after the 
position marked by ni's cursor, as shown in Figure [T] where 
an arrow ^ marks the position of the cursor, when it is not 
off. The position "after the cursor" is not defined if the cursor 
is off; developers wrote a precondition (require clause) to 
merge _tree_after to enforce this constraint on the input: 

merge _tree_after {other: TWO_WAY_TREE) 
require not off 

where off is a Boolean query that holds when the cursor of the 
Current node is off (such as for nodes no and n2 in Figure [T]). 
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Fig. 1. Calling ni merge _tree_after (722) on the left tree results in the tree 
shown on the right. 

Dynamic analysis with AutoTest reports the dynamically 
inferred precondition for merge _tree_after : 

pre_3: not Current. is_sihling {other) . 

pre_3 reveals that merge _tree_after has never been tested 
with sibling nodes, that is nodes at the same level of the 
tree (e.g., ni and n2 in Figure [T]). Correspondingly, stateful 
testing looks up the database for objects violating pre_3, 
suitable to generate new test cases: two objects ol, o2 of 
type TWO_WAY_TREE that satisfy ol . is_sibling {o2) and 
not ol . off — the latter constraint is merge _tree_after 's 
programmer- written precondition. 

Unfortunately, no pair of objects in the database satis- 
fies all these constraints: there are several trees with sib- 
ling nodes, but all of them have their cursor off, hence 
merge _tree_after cannot be applied. In such situations, 
stateful testing selects available objects that satisfy some of 



the requirements and searches for routines that can mutate the 
object state to satisfy the missing requirements. The database 
also includes information on the behavior of routines, collected 
during dynamic analysis. 

In the running example, stateful testing searches for a 
routine of class TWO_WAY_TREE that can change a node 
where off is False to one where it is True. Routine start 
moves the cursor to the first child node (if the child list is not 
empty), hence it satisfies the search criteria. With this routine, 
stateful testing generates a new test case for merge Jreejffter 
as follows. It selects two serialized objects ol, o2 of type 
TWOJWAYJTREE that are siblings; the test case consists of 
two consecutive calls: 

ol . start ; ol . merge _tree_after {o2) . 

In our experiments, this new test case triggered a failure, 
showing that merge Jreejffter does not work correctly on 
sibling nodes. This fault went undetected in the random testing 
session, but stateful testing readily exposed it. 

III. How Stateful Testing Works 

This section starts with an overview of how stateful testing 
works (Section |III-A| ), and then describes the details of the 
technique: what are the products of random testing (Sec- 



tion |IIFB]), how stateful testing processes and organizes them 
(Section 
(Section 
(Section 



II-C), the role of dynamically inferred contracts 



III-D ), and their reduction to produce new test suites 



A. Overview 

Figure |2] provides a bird's eye view of how stateful testing 
works. Stateful testing is a fully automated technique that 
produces new test cases from an existing test suite: 

1) Running AutoTest, the automatic random testing 
framework for Eiffel, for several hours produces a large 
pool of objects, and a test suite based on those objects. 

2) Stateful testing selects and extracts information from the 
object pool and the test suite and stores it in a relational 
database: the object/transition database. 

3) Autolnfer, the dynamic contract inference component 
of AutoTest, summarizes the behavior of the test cases 
in the test suite in the form of dynamically inferred 
contracts. 

4) The reduction phase extracts objects from the database 
that violate some of the inferred contracts. The extracted 
objects support the generation of a new test suite, which 
exercises the classes under tests differently than in the 
original test suite. 

5) Executing the new test suite can uncover new faults in 
the code under test, and reveal which of the inferred 
contracts are incorrect and should be discarded. 

B. Preliminaries: test cases and objects 

A test case t is the call of a routine r on a target object 
ao with actual arguments ai, . . . , that returns an object b, 
denoted as: 



t 



ao.r {ai, . . . ,am) : b. 
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Fig. 2. Overview of how stateful testing works. 



If r is a command (procedure), which does not return any 
value, replace h with the dummy object e; if r is a creation 
procedure (constructor), which returns a fresh object, replace 
the target ao with e. 

Contracts are annotations using the same syntax as pro- 
gramming language Boolean expressions; they specify the 
behavior of routines through preconditions and postconditions. 
The precondition of a routine r is a predicate that r's target 
and arguments satisfy before the call; for example, pre_l in 



Section II- A declares that the Current list (i.e., the target) 
and the other list (i.e., the argument) are disjoint, for every 
call of merge. The postcondition of a routine r is a predicate 
over r's result (if any), as well as r's target and arguments; 
postconditions can refer to targets and arguments both in 
the post-state (i.e., after the call) and in the pre-state (i.e., 
before the call, with the old keyword). For example, post_2 
in Section II-B| specifies that, if the target list and the other 
list contained the same elements before a call to merge, then 
the target is empty after the call. In Eiffel, programmers can 
annotate routines with pre (require clause) and postconditions 
(ensure clause); stateful testing includes a contract inference 
phase that supplements the contracts written by programmers 
with inferred contracts. 

Contracts provide a criterion to determine if a test case is 
passing or failing completely automatically. A routine's test 
case ao.r (ai, . . . , ttm) : ^ is valid if its target and arguments 
ao, ai, . . . , satisfy r's precondition, and is invalid other- 
wise. Executing a valid test case t changes the target and 
arguments into the post- state ag, a'^, . . . , a^, denoted 

(ao,a;,...,a;^). 

t is passing if executing the test case triggers no exceptions, 
and the post-state (ag, a'^, . . . , a^), the pre-state (ao, ai, . . . , 
ajn) and the returned object h satisfy r's postcondition; other- 
wise, t failing. 

Stateful testing builds upon an existing test suite that 
exercises a set of classes. A test suite is a collection T = 
{^1,^2,...} of test cases; it induces the set O = {01,02, . . .} 
of all objects mentioned in T's test cases or in the post- state of 
passing test cases; O is the object pool. Stateful testing works 



Listing 1. Routines of class LIST with contracts. 

make: LIST Create an empty list 

ensure Result. is _empty 

wipe_out — Remove all elements 

ensure is_empty 

extend (v: ANY) Add 'v' to the end 

ensure has (v) 

append {other: LIST) Append 'other' to the end 

require other 7^ Void 

has (v: ANY): BOOLEAN Does the Hst include 'v'? 

is_empty : BOOLEAN Is the list empty? 



independently of how the object pool O and the test suite T 
are generated. Its implementation in the AutoTest framework, 
however, generates them completely automatically from a set 
of Eiffel classes with random testing. 

Example. The class LIST implements dynamic lists; it 
is modeled after real Eiffel classes, but is simplified for 
clarity. Listing [T] shows the signatures of LISTs routines with 
programmer- written contracts. Consider the test suite T: 

ti: e.make : li (e) 
^2: li.wipe_out : e (li) 
ts: h.append (li): e {h^h) 
t^: li.extend (/i): e {h^h) 
t^: li.is_empty: 63 (Z^^) 
e.make : I4 (e) 

where all test cases are passing. For simplicity, we do not 
introduce new duplicate objects in T when they are unchanged 
in the post-state with respect to the pre-state; for example, 
denotes a call to extend with li as target and argument, and 
I2 is the name given to the list after extending it, whereas 
^5 does not change the target h which is then repeated in 
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the post-state. T induces the object pool O = {/i, ^2, ^3, ^4}, 
where /i,/4 are empty lists, I2 has one element (a reference 
to I2 itself), and 63 is the Boolean True. 

C. Object/transition database 

The object/transition database contains detailed information 



about the objects in the object pool O. Section IV details 
how the database is implemented with relational database 
technology; the current section describes how stateful testing 
selects and extracts the information to store in the database. 

Abstract object states and transitions. The ob- 
ject/transition database stores all objects in the pool O in 
serialized form. On top of the serialized objects, the database 
stores their abstract state, expressed in terms of the public 
queries (functions) of the objects. In the running example, 
class LIST has two public queries: is_empty and has. We 
would like to have information as extensive as possible in 
the database: for every combination of objects in the pool, 
evaluate every public query that is applicable. This is clearly 
unfeasible for object pools of non-trivial size, hence stateful 
testing uses a heuristic based on the usage of objects in the 
test suite T. For an object o G (9, consider the set p{o) of 
objects reachable by recursively following references among 
o's attributes, and including o itself; because of how the object 
pool is defined, p{o) is a subset of O. Extend the notation to 
objects reachable from a set of objects: p{0) = [j^^Qp{o). 
For every test case t = a^.r (ai, . . . , a^) : b (ag, . . . , a^), 
the database stores all the applicable public queries q: 



ao.q{ai, . . . ,an) : /3 

UJO.q{uJl, . . . ,UJn) ' ^ 



(1) 

(2) 



where ao, c^i, . . . , o^n range over the set p(ao, ai, . . . , am) of 
object reachable in fs pre-state, and cjq , cji , . . . , cj^ range over 
the set p(6, ttg, a'l, . . . , a^) of object reachable in fs post- 
state. Precisely, for every call of the form ([T]) or ([2]), the 
database adds the objects P^ip in serialized form and includes 
a tuple with q's signature, and references to the serialized 
objects ao, . . . , Q^n, ^0, • • • 7 ^n, stored in the database. 
The current tool implementation supports queries of generic 
return type; for simplicity, the presentation in this paper only 
considers queries that return Boolean values. 

The database also stores information about transitions: each 
transition associates the routine r with several pairs of query 
evaluations; the first element of the pair evaluates a query in 
the pre-state ([T]), and the second evaluates it in the post-state 
([2]). Then, the transition represents the fact that calling r when 
the pre-state holds can drive the object to the post-state. 

Continuing the running example (Listing [T]), the test cases 
tijt2jts only mention the list object /i, which produces 
the queries li.is_empty\ True and li.has (/i): False. ^4 intro- 
duces the object I2, hence the new queries l2-is_empty'.^2\s^, 
h.has (^2): False, l2.has (/i): False, l2.has (/2):True. Iq in- 
troduces two more queries on ^4: l4.is_empty: True and 
lA.has (M): False. Finally, induces the only non-trivial 
transitions from a pre-state where is_empty evaluates to True 



and has to False, to a post-state where both queries change 
their returned value when evaluated on the changed target. 

Public branch and path conditions. To increase the preci- 
sion of the abstract states stored in the database, stateful testing 
includes the value of several Boolean expressions extracted 
from the program text. For every test case t exercising a 
routine r, collect all the Boolean expressions ei, 62, . . . that ap- 
pear as branch conditions or as path conditions in r's control- 
flow graph, and that only reference public features (members) 
of r's containing class. The rationale for storing branch and 
path conditions is that they often offer "interesting" partitions 
of the input states. The database stores the evaluations of 
these expressions for each applicable combination of objects 
reachable in the pre-state and in the post-state of every test t. 

D. Dynamic contract inference 

To get a concise characterization of the test suite T in terms 
of class features, stateful testing performs contract inference 
with dynamic techniques. The implementation uses Autoln- 
fer 1 3 1, the inference component of the AutoTest framework. 

Contract inference only considers the passing test cases 
from the suite T and produces, for each routine r exercised 
in the test suite, a list pre{r) of preconditions and a list 
post{r) of postconditions. These inferred contracts summarize 
r's behavior with the test cases in T: for every passing test 
t = aQ.r (ai, . . . , a^) : b (ag, . . . , a^) in T, the arguments 
and the target satisfy all preconditions in pre{r), 
and the result b (if any) and post-state ag, a'^, . . . , satisfy 
all postconditions in post{r). 

In the running example (Listing [T]), wipe_out is always 
invoked on an empty list, hence is_empty is an inferred precon- 
dition in pre{wipe_out)\ append is invoked once on an empty 
list which is still empty after the call, hence old is_empty 
implies is_empty is a postcondition in post{append) , and 
other .is _empty is a precondition in pre{append)r\ 

The inferred contracts are typically different than those 
programmers write: the former tend to be more detailed and 
numerous than the latter, especially in the case of postcon- 
ditions, which programmers neglect but dynamic analysis 
is effective at reporting jlj, (TOj. Furthermore, dynamically 
inferred contracts have no guarantee of being correct: since 
they are based on a finite number of observations, they may 
merely be a reflection of a not sufficiently varied test suite, 
such as the two examples discussed in the previous paragraph. 

E. Reduction 

After building the object/transition database and collecting 
the inferred contracts, stateful testing generates a new test suite 
by precondition reduction. The basic idea is partitioning the 
input space: a predicate p defines two regions, one where p 
holds and one where it doesn't; a comprehensive test suite 
should cover every region, for every combination of "inter- 
esting" predicates, with at least one test case. This is clearly 

^Dynamic inference does not really infer contracts based on so few test 
cases because they are statistically insignificant; the example is only for 
illustration purposes. 
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unfeasible, because the predicates are too many; precondition 
reduction is a heuristic technique that considers a reduced 
number of partitions based on the inferred preconditions. 

1 ) Precondition reduction: The precondition reduction of a 
routine r generates new inputs to test r by trying to invaUdate 
r's inferred preconditions. Suppose r has m arguments, and 
let require (r) denote r's programmer- written preconditions. 
Select a dynamically inferred precondition p from the set 
pre{r) and build the predicate: 



-ip A require(r) . 



characterizes objects that satisfy r's programmer- written 
preconditions but violate the inferred p, hence they can be 
used to test r in a way not covered by the existing test suite. 

Stateful testing searches the object/transition database for 
tuples of objects (oq, oi, . . . , 0^) that satisfy Jlfc^ (expressed 
as a conjunction of elementary expressions). In the running 
example (Listing [T]), wipe_out's inferred precondition is_empty 
suggests to search for objects of type LIST satisfying not 
is_empty (wipe_out has no programmer- written precondition); 
I2 satisfies the search criterion. 

For each tuple (oq, oi, . . . , o^) retrieved in the search, 
stateful testing constructs the new test case 



oo.r (oi, . . .,0m) 



In practice, there is a cut-off on the number of retrieved tuples 
(if they are too many, only a few are tried) and a time-out on 
the time spent searching the database (if no tuple is found by 
the time-out, we move to the next reduction). If t"®^ is passing, 
then the precondition p is unsound and removed from pre{r); 
if t"®^ is failing, a fault is found (and p is also unsound). 
Since the information stored in the database is incomplete, 
^new ^jgQ invalid, in which case it is simply discarded. 
In the running example, the list I2 is not empty and the test 
case l2.wipe_out exercises wipe_out in ways not tested before. 

2) Using transitions: If the search for objects satisfying 
iltp fails, stateful testing tries to retrieve objects satistying a 
weaker predicate than and then it searches for a transition 
that drives the objects to match the desired Jftp. To this end, 
put iltp in conjunctive normal form ci A • • • A and select 
1 < d < n clauses to drop; without loss of generality, we drop 
ci, . . . , Q (D) and we keep . . . , Cn (K): 

*^ = ci A • • • Q A Q+i A • • • A Cn . 



D 



K 



For any tuple of objects (oq, . . . , o^) satisfying search the 
object/transition database for transitions that can transform 
a tuple (oo, . . . , Om) satisfying -\D into a tuple (oq, . . . , 
o^) satisfying D. Every such transition consists of a routine 
s and a mapping /i : [0..n] [0..m], where s has n > 
arguments, /i binds the objects oq, . . . ,0^ to s's target and 
arguments: the z-th argument is instantiated with o^(^). For 
every such transition, construct the new test case 

f""^ = 0^(0). s (0^(1),..., o^(^)) ; oo.r(oi,...,o^), 

consisting of two consecutive calls. 



In the TWO_WAY_TREE example in Section [TTCl the 
dropped clause D is not Current. (9j^ and the kept clause 
K is Current. is_sibling {other). Two objects ol, o2 in the 
database satisfy ol. is_sibling (o2), and a transition suggests 
that routine start can change ol from ol . off to not ol . off. 

The search for transitions is heuristic: since the information 
about transitions in the database is incomplete in general, the 
routine s may be inapplicable to the objects (oq, . . . , o^), or 
it may not drive them in a state satisfying Jftp — for example, 
because it invalidates i^T as a side-effect of satisfying D. In 
practice, the heuristic search is reasonably successful when 
there are objects whose state is close to satisfying Jlt^; cor- 
respondingly, the current implementation drops at most one 
clause (d = 1), and does not build sequences of transitions 
with more than two calls. 

3) Detecting unsound postconditions: Inferred postcondi- 
tions can be unsound, too, but we cannot directly select objects 
that violate postconditions, because we do not have direct 
control over post-states. Precondition reduction, however, can 
also help to invalidate inferred postconditions, while testing 
routines more thoroughly. Consider an inferred postcondition 
q in post{r) in the form: 



q : old(A) 



C. 



We focus on postconditions in this form, because q naturally 
expresses many postconditions where a property C of the post- 
state is a consequence of a property A of the pre- state (old). 
Invalidating the implication q means producing test cases that 
start in a pre-state where A holds and reach a post-state 
where holds. The existing test suite does not include such 
test cases, otherwise P would not be a dynamically inferred 
postcondition. 

The inferred preconditions, however, help select pre-states 
that may challenge the validity of To this end, consider the 
set pre{r\A) of r's dynamically inferred preconditions that 
hold when A also holds. Select sl p e pre{r\A) among these 
preconditions and build the predicate: 

^^p^q ^ ^^p ' 

Then, select (or build with transitions) objects (oo,...,0m) 
that satisfy and generate the new test case t"®^ that 

calls r on (oo, . . . , o^) (as in Section III-El ). If t"®^ is valid 



and passing (with respect to r's programmer- written contracts 
only) but C is false after executing it, the postcondition q is 
unsound and is removed from post{r) \ if t^^^ is failing (again 
with respect to r's programmer- written contracts, which are 
always assumed correct), it also shows a fault. 

In the example of Listing [T] stateful testing targets the 
inferred postcondition old is_empty implies is_empty of rou- 
tine append, which is in the form g: A is is_empty\ for the 
same routine, other . is_empty is a precondition inferred when 
A also holds. Hence, stateful testing looks for two lists, one 
empty and one not; h^h satisfy the criterion and yield the 
new test case li.append (I2). 
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IV. Object/transition database 

Section IIII-CI describes what kind of information the 
object/transition database stores; the present section details 
how the database is implemented with relational technology. 



Predicates_1 




TestCases 


tid 




tid 

class 
routine 
pre.serialized 
posts erialized 


name 

var^O 

type^O 

ret^value 

kind 







PREDICATES_(n + 1) 



tid 

name 
vav-O 
type-0 

var_{n + 1) 
type_{n+l) 
ret^value 
kind 



Fig. 3. Relational schema of the object/transition database. 

A. Relational schema 

Figure [3] shows the most significant parts of the 
object/transition database's relational schema. The database 
is centered around the test cases in the test suite T: for 
each test case ao.r (ai, . . . , am) : h (ag, . . . , a^) table 
TestCases stores: (1) a unique identifier (attribute tid), 
(2) r's class (class), (3) r's name (routine), (4) the list 
ao, ai, . . . , ttm, <^m+i7 • • • , of all serialized objects in the 
pre-state followed by all objects reachable from them (pre_- 
serialized), (5) the list ag, , . . . , a^, 6, a^^^ , . . . , a^, of all 
serialized objects in the post- state followed by all objects 
reachable from them (post_serialized). We do not discuss the 
straightforward details of how lists of serialized objects are 
encoded as sequences of characters with separators. 

The other tables Predicates_1, Predicates_2, 
Predicates_9 store information about the abstract state 
of objects, in the form of predicates over 1, 2, 9 
objects. Consider an atomic Boolean predicate q evaluated 
over n + 1 objects, in the form o^.q (oi, . . . ,0^) : v, that 
holds for the pre-state of a test case with identifier x. 
Table PREDlCATES_(n + 1) stores an entry with: (1) a refer- 
ence to the test case x (attribute tid), (2) the normalized textual 
form of the predicate, obtained by replacing every reference 
to objects with '$' placeholders as in $.g ($,..., $) (attribute 
name), (3) for each object o^,0<i<n + l, an integer ki 
such that the /c^-th element of the list in the pre_state attribute 
of the test case with tid x contains oi (attribute var_i), (4) for 
each object Oi, < i < n+1, its dynamic type (type_i), (5) the 
Boolean value v returned (attribute ret_value), (6) the constant 
pre to denote that q is evaluated over the pre-state (attribute 
kind). For predicates evaluated in the post-state, attribute kind 
stores the constant post, and everything else is like for pre- 
states. 

Consider, for example, the test case in the running 
example (Section III-B): li.extend(li) : e {h^h)- Table 
TestCases stores a tuple {id^^LIST^ extend, tt, 11), where id^ 
is the unique identifier, tt is the list of (serialized) objects 
in the pre-state: tt = and 11 is the list of objects 

in the post-state: 11 = [h^h]- ^4 induces, among others, the 



evaluation of the query li.has(li) in the pre-state, which table 
Predicates_2 stores as the tuple 

{id^, %.has(%), O.LIST, O.LIST, Fsilse, pre) 

where the entries in attributes var_0 and var_l refer to the 
first element in the 0-indexed list tt of serialized objects (i.e., 
object 1 1 in serialized form). 

B. Quering the database 

The translation of predicates into SQL queries to the 
object/transition database is straightforward: 

• Objects become variables in the SELECT clause; 

• These variables are joined with the PREDICATES and 
TestCases tables in the FROM clause; 

• The WHERE clause encodes the constraints on the 
individual predicates, and SQL Boolean operators map 
the Boolean connectives in the translated predicate. 

Let us demonstrate the creation of SQL queries with the 



example at the end of Section in-E3 where we search for 
two objects /i,/2 of type LIST such that li.is_empty and 
not l2.is_empty. We create the SQL query in Listing |2] that 
searches for such objects in pre- states (the query for post- 
states is all similar). The SQL query returns a tuple objsl , 
objs2 , idxl , idx2 such that objsl , objs2 are collections of 
serialized objects and idxl , idx2 are integer indices: the 
idxl -th element in collection objsl is an empty list, and the 
idx2 -th element in collection objs2 is a non-empty list. 



Listing 2. An SQL query searching for two lists. 

SELECT 

tl . pre _s erialized as objsl , t2 . pre _s erialized as objs2 , 
pi .var_0 as idxl , p2.var_0 as idx2 
FROM 

Predicates _1 pi join TestCases tl on pi. tid = tl.tid. 
Predicates _1 p2 join TestCases t2 on p2. tid = t2 . tid 
WHERE 

pi. name = '$.is_empty' AND 
pi. type _0 = 'LIST' AND 
pi. ret_value AND pi. kind = 'pre' AND 

p2.name = '$.is_empty' AND 
p2.type_0 = 'LIST' AND 
NOT (p2.ret_value) AND p2.kind = 'pre' 



V. Evaluation 

This section presents the results of an experimental eval- 
uation, summarized in Table |l| the leftmost part of the table 
contains statistics about random testing, the middle part shows 
the performance of stateful testing with preconditions and the 
rightmost part with postconditions. 
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TABLE I 

Classes under test and results. 
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15 
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7 
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22 
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95 
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57 
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68 


60 
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Stateful testing with postconditions 



A. Experimental setup 

The experiments targeted 13 Eiffel classes implementing 
data structures from the libraries EiffelBase 1 8| (revision 506) 
and Gobo |[9| (revision 6665). Table [I| lists the size of each 
class in lines of code (LOC) and public routines (#R). 

Each session in the preparation of the original test suite 
with random testing ran on a Linux node with a 2.53 GHz 
Intel Nehalem quad-core CPU and 8 GB of memory. The 
other experiments (contract inference and stateful testing) ran 
on an Ubuntu machine with a L73 GHz Intel Core i7 CPU 
and 8 GB of memory. The average speed of random testing, 
contract inference, and stateful testing on the two architectures 
is comparable. 

1 ) Random testing: To generate the original test suite T — 
upon which stateful testing builds — AutoTest ran 30 sessions 
of random testing for each of the 13 classes. A session lasts 80 
minutes and initializes the pseudo-random number generator 
with a new seed. The 30 sessions totaled 520 hours of testing 
and generated a test suite with 149,293 distinct test cases. The 
test suite T revealed 95 distinct fault^ (column #E of Table [l]). 

2) Stateful testing running time: Dynamic contract infer- 
ence. Autolnfer processed the test suite T for 16 hours and 
reported 1741 preconditions and 973 postconditions express- 
ible as implications, shown in column #Tp and #Tq in Table |l] 
Manual inspection revealed that 1012 (58%) of the inferred 
preconditions and 68 (7%) of the inferred postconditions 
are unsound. Columns #Up and #Uq respectively report the 
number of unsound pre and postconditions for each class. 

Object/transition database construction. Constructing the 
object/transition database from T took 5 hours. The database 
contains about 3.5 million objects, 18.4 million predicate 
evaluations, and 68.8 thousand transitions, and occupies 3.4 
GB on disk. 

Reduction. Notice that querying the object/transition data- 
base gives predictable results, hence the reduction is deter- 
ministic and needs to run only once. Stateful testing ran for 
15 hours trying to violate the inferred pre and postconditions. 
The times (in minutes) spent on the pre and postconditions in 

^Two faults are distinct if they violate two different contract clauses. 



each class are shown in columns #Mp and #Mq of Table [T| In 
the experiments, every query times out after one minute. 

B. Experimental results 

In all, stateful testing discovered 65 new faults in the classes 
under test, corresponding to a 68.4% improvement over the 
number of faults found by random testing, with only a 7% 
time overhead (36/520 hours). Columns #Ep and #Eq in 
Table [l| respectively show the number of new faults detected 
while trying to violate the inferred pre and postconditions in 
each class. The performance in terms of number of unsound 
preconditions and postconditions detected is given below. 



Building upon random testing, stateful testing detected 
68.4% new faults in a fraction of the time. 



1) Unsound preconditions: Table [Il| gives an account of the 
most common structures of the inferred preconditions targeted 
in the experiments. Stateful testing tried to invalidate the 1741 
inferred preconditions for 8.2 hours (i.e., about 18 seconds 



per precondition), following the technique in Section III-El 
It successfully invalidated 1006 (99.4%) of the unsound pre- 
conditions (column #Vp of Table [l| which also report the 
percentages relative to column #Up), while exposing 57 new 
faults (column #Ep). 

TABLE II 

Structure of inferred preconditions. 



Structure 


Example 


#T 


Reference equality 


ol = o2 


154 


Object equality 


ol.is equal (o2) 


329 


Voidness check 


o /Void 


7 


Integer equality 


o. count = 


377 


Boolean query with arguments 


o.hasiy) 


483 


Boolean query without arguments 


o . is_empty 


356 


Other (complex) 


o . full or i < / . count 


35 


Total 




1741 



2) Unsound postconditions: Stateful testing tried to inval- 
idate the 973 inferred postconditions in implication form for 
6 hours (i.e., about 23 seconds per postcondition), following 
the technique in Section |III-E3[ It successfully invalidated 
60 (88.2%) of the unsound postconditions (column #Vg of 
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Table |lj which also report the percentages relative to column 
#Uq), while exposing 8 new faults (column #Eq). 

3) Undetected unsound contracts: Stateful testing only 
failed to detect 6 unsound preconditions (0.6% of the total) 
and 8 unsound postconditions (11.8%). In all such cases, 
no serialized objects were in a state violating the contract 
(or sufficiently close to it), or the predicates provided an 
abstraction of the object state that was too coarse-grained for 

the desired objects to be identifiable. 

Stateful testing increased the soundness of inferred 
contracts from 60.2% to 99.5%. 

VI. Limitations and future work 

Stateful testing, and its current implementation, has some 
limitations to be addressed in future work. 

• As it is customary in random testing |11|, we have 
evaluated stateful testing on classes implementing data 
structures. This was also useful for comparison against 
out previous experience with AutoTest |[T|. Further ex- 
periments will target different types of classes. 

• Stateful testing can start from a test suite T generated 
manually or with any technique, all our experiments used 
test cases automatically generated. Further experiments 
will determine if the performance of stateful testing is 
affected by how the original test suite is generated. 

• The current implementation of stateful testing adopts the 
following heuristics when retrieving objects and tran- 
sitions from the database: (1) search for objects and 
consider the first 45 results; (2) if none of the 45 retrieved 
objects work, search for transitions and call them on the 
45 result objects; (3) if none of the transitions work, give 
up and move to the next reduction. This heuristic worked 
quite well in the experiments, but further experience will 
determine if it can be improved and how robust the results 
are with respect to this heuristic. 

• Future experiments will try to iterate the infer/reduce 
process on the new test suite generated with stateful 
testing. This will challenge the state of the art in dynamic 
invariant inference and is likely to suggest improvements 
to the techniques used in the process. 

VII. Related Work 

Xie and Notkin |5 | first suggested a framework that com- 
bines test-case generation and dynamic specification inference 
with the goal of mutually enhancing their results. 

Dallmeier et al. |12| implement Xie and Notkin's ideas 
for typestate specifications (finite-state automata describing 
abstract object states and transitions), and report an evaluation 
showing that their technique builds more accurate specifi- 
cation and finds more errors injected in Java applications 
than traditional dynamic analysis techniques |13|. Stateful 
testing is based on the same principles — applied to contract 
specifications — and extends them with the usage of a database 
to improve the reuse of previous testing sessions and to build 
new test cases. Typestates provide object- state abstractions — 
based on argumentless Boolean queries and simple integer 



partitioning — that are coarser-grained than the one deployed in 
the present paper; consequently, building a typestate model by 
exhaustive exploration is feasible, whereas our more detailed 
model requires heuristics and an efficient search of serialized 
objects to be built. We also provide an implementation and 
experimental evaluation. 

Stateful testing combines diverse techniques of program 
analysis; the rest of this section summarizes some represen- 
tative work involving these techniques. More comprehensive 
references are available in the bibliography of the cited work. 

A. Automated test-case generation 

Automated random testing |6| is now a well-understood 
technique which, in spite of the simplicity of its underlying 
ideas, is quite effective and can find subtle bugs |1|. Arcuri 
et al.'s analysis of random testing |2| analytically confirms 
the experimental results, and suggests that more sophisticated 
test-case generation techniques are best deployed after random 
testing exhausts its potential. Stateful testing is indeed applied 
following random testing sessions, to reuse the objects gener- 
ated and find new inputs and errors. 

Search-based test-case generation refines random test- 
ing with goal-driven searches in the space of test cases; 
McMinn |14| and AH et al. fT5| survey the state of the 
art in search-based techniques. Test suite augmentation fT6| 
uses search-based techniques driven by coverage criteria, with 
the purpose of adapting a regression test suite to changed 
code. Genetic algorithms are a recurring choice to search 
for test inputs; Tonella |T7| first suggested the idea, and 
Andrews et al. |18| show how to use genetic algorithms to 
optimize the performance of standard random testing. Stateful 
testing is also search-based, but the search takes place among 
previously generated objects, and it is guided by contracts that 
characterize an existing test suite to be improved. 

Other refinements of random testing combine it with 
white-box techniques such as symbolic execution (e.g., fT9l), 
or leverage the availability of formal specifications in various 
forms (see Hierons et al. |20| for a survey). Stateful testing 
also makes extensive usage of specifications in the form of 
contracts, both inferred and written by programmers. 

In previous work pT| , we developed precondition satisfac- 
tion, a search strategy that improves the selection of objects 
to test routines with complex preconditions. This technique is 
included in AutoTest, and all the random testing session that 
preceded stateful testing (in the experiments of Section |V]) 
deployed precondition satisfaction. 

B. Dynamic specification inference 

Daikon f22l pioneered the dynamic inference of specifi- 
cations and program invariants, and showed that assertions 
"guessed" based on a finite number of runs are often sound 
with respect to generic executions. Since the first Daikon 
release, dynamic inference has been applied to other speci- 
fication models (e.g., typestates p3| ) and has improved its 
accuracy (such as in our own Autolnfer f3\). 
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Gupta and Heidepriem | [24| suggest to improve the quality 
of inferred contracts by using different test suites (i.e., based 
on code coverage and invariant coverage), and by retaining 
only the contracts that are inferred with both techniques. Fraser 
and Zeller |25 | simplify and improve test cases based on 
mining recurring usage patterns in code bases; the simplified 
tests are easier to understand and focus on common usage. 
Other approaches to improve the quality of inferred contracts 
combine static and dynamic techniques (e.g., |26J, [27]). 

Stateful testing leverages dynamic contract inference tech- 
niques, and tries to violate inferred contracts to explore new 
regions of the input state space; this not only improves the 
test suite, but it also detects many unsound contracts. Stateful 
testing also includes the results of some lightweight static 
analysis (based on the branching structure of routines) to 
gather more information about the available test suite. 

C. Search-based techniques 

The idea of constructing "semantic" databases, with a 
uniform search interface to retrieve programs |28|, program 
elements |29|, |30|, or test cases |31] with specific character- 
istics has recently been deployed, mostly to help developers 
organize their code and reuse products written by others. 

The object capture technique |32| stores serialized objects, 
created during program executions, and reuses them as new 
test inputs to reach uncovered branches. Stateful testing also 
stores serialized objects and reuses them to create new test 
cases; the object capture framework, however, only supports 
searching for objects based on their types, whereas stateful 
testing stores rich information about the objects' abstract 
states, as well as transitions between states. Another significant 
difference is that stateful testing targets the mutual improve- 
ment of test suites and inferred contracts, whereas object 
capture is only concerned with improving branch coverage. 

VIII. Conclusions 

This paper presented stateful testing, a completely auto- 
mated testing technique which generates new test cases from 
an existing test suite. Stateful testing works by trying to reduce 
(i.e., invalidate) the inferred contracts that characterize the 
existing test suite. Extensive experiments show that stateful 
testing is quite effective: it generates tests that uncover new 
faults and invalidates many of the unsound contracts inferred 
dynamically from the original test suite. 

Stateful testing is part of the automated testing framework 
AutoTest; the source code of AutoTest — including the basic 
testing infrastructure, dynamic contract inference, and the 
stateful testing implementation — detailed experimental results 
and instructions to reproduce the experiments are available at: 

|http ://se .inf . ethz . ch/research/autotest| 
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